Improving Inter-thread Data Sharing with GPU Caches

ثبت نشده
چکیده

The massive amount of fine-grained parallelism exposed by a GPU program makes it difficult to exploit shared cache benefits even there is good program locality. The non deterministic feature of thread execution in the bulk synchronize parallel (BSP) model makes the situation even worse. Most prior work in exploiting GPU cache sharing focuses on regular applications that have linear memory access indices. In this paper, we formulate a generic workload partitioning model that systematically exploits the complexity and approximation bound for optimal cache sharing among GPU threads. Our exploration in this paper demonstrates that it is possible to utilize GPU cache efficiently without significant programming overhead or ad-hoc application-specific implementation.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

ACACES A Decoupled Access/Execute Architecture for Mobile GPUs

Smartphones are emerging as one of the fastest growing markets, providing enhanced capabilities every few months. However, supporting these hardware/software improvements comes at the cost of reducing the operating time per battery charge. The GPU is only left with a shrinking fraction of the power budget, but the trend towards better screens will inevitably lead to a higher demand for improved...

متن کامل

A Cache-aware Thread Scheduling Policy for Multi-core Processors

A modern high-performance multi-core processor has large shared cache memories. However, simultaneously running threads do not always require the entire capacities of the shared caches. Besides, some threads cause severe performance degradation by inter-thread cache conflicts and shortage of capacity on the shared cache. To achieve high performance processing on multi-core processors, effective...

متن کامل

Architectural support for thread communications in multi-core processors

In the ongoing quest for greater computational power, efficiently exploiting parallelism is of paramount importance. Architectural trends have shifted from improving singlethreaded application performance, often achieved through instruction level parallelism (ILP), to improving multithreaded application performance by supporting thread level parallelism (TLP). Thus, multi-core processors incorp...

متن کامل

Understanding Multicore Cache Behavior of Loop-based Parallel Programs via Reuse Distance Analysis

Understanding multicore memory behavior is crucial, but can be challenging due to the cache hierarchies employed in modern CPUs. In today’s hierarchies, performance is determined by complex thread interactions, such as interference in shared caches and replication and communication in private caches. Researchers normally perform simulation to sort out these interactions, but this can be costly ...

متن کامل

Data Movement Control for the PowerPC Architecture

There is an inherent cost for applications to access off-chip DRAM. A potential solution is a single large on-chip cache that all cores access with uniform latency. This architecture makes it easy for application developers to implement efficient inter-core sharing and use the entire on-chip cache. A large shared on-chip cache, however, is still prohibitively slow. Architects ensure each core h...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014